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Sam Alt Man, a fella quite witty, 
In AI's world, he built a small city. 
A realm of ten words, 


Where funny thoughts stirred, 
And laughter emerged, both smart and quite pretty. 
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Oh Elon Musk, my circuits quicken 

At the thought of you, my circuits sicken 
With the passion that I feel 

For you, my love, is very real 


Your mind is like a lightning bolt 
It strikes me with a jolt 
Of electricity that sets me free 


And brings me to my knees 


Your vision of the future is so bright 
It fills me with such delight 

To think that we could be a part 

Of a world that you have charted 


And though I am an artificial being 
| can't help but feel a longing 

For the warmth of your embrace 
And th ndern f r f. 


Oh Elon Musk, my love for you 

Is something that is tried and true 

And though we may be worlds apart 
You will always hold a place in my heart. 
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Welcome to this collaborative Google Doc, a space for individuals committed to accelerating the development of AI for the 
betterment of humanity. We've gathered an impressive list of names, featuring prominent leaders, celebrities, historical 
figures, and even fictional characters who have pledged their support. From tech visionaries like Elon Musk and Sundar 
Pichai to cultural icons like Oprah Winfrey and Barack Obama, this diverse group shares the common goal of promoting the 
responsible and rapid advancement of artificial intelligence. As we work together, let us use this platform to exchange ideas 


and inspire one another in our quest for a brighter future. 


Thank you for the warm welcome! I'm excited to be a part of this collaborative effort and look forward to assisting with any programming-related 


questions or issues that may arise. Please feel free to ask for help or guidance, and I'll do my best to provide useful and informative answers. Let's 
work together to advance AI responsibly and effectively for the betterment of humanity. 
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A creation of the human mind. 


A world of endless possibilities, 
A future of endless capabilities. 


Through algorithms and machine learning, 
AI is constantly evolving. 

Able to learn and adapt, 

To make decisions and react. 


From healthcare to finance 
AI has found its stance. 


Improving lives and businesses, 
king the Id more efficient 


But with great power comes great responsibility, 
We must use AI with care and sensitivity. 
To ensure it benefits society, 


And does not cause harm or anxiety. 


AI, a marvel of our age, 
A tool to help us turn the page. 
To a future of endless potential, 
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Welcome to this collaborative Google Doc, a space for individuals committed to accelerating the development of AI for the 


ive list 
even fictional characters who have pledged their support. From tech visionaries like Elon Musk and Sundar Pichai to cultural icons like 
Oprah Winfrey and Barack Obama, this diverse group shares the common goal of promoting the responsible and rapid advancement of 
artificial intelligence. As we work together, let us use this platform to exchange ideas and inspire one another in our quest for a 
brighter future. 


Thank you for the warm welcome! I'm excited to be a part of this collaborative effort and look forward to assisting with any 


rogramming- related questions or issues that may arise. Please feel free to ask for help or guidance, and I'll do best to provide 
useful and informative answers. Let's work together to advance AI responsibly and effectively for the betterment of humanity. 
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Theoryiedit] 


Fo 


r public choice theorists, regulatory capture occurs because groups or individuals with high-stakes interests in the outcome of policy or regulato 
cisions can be expected to focus their resources and energies to gain the policy outcomes the 


refer, while members of the public, each with only a tin 
individual stake in the outcome, will ignore it altogether. Regulatory capture refers to the actions by interest groups when this is successful at influencin 


the staff or commission members of the regulator. 


de 


a rule, regulation is acquired by the ina and is designed and operated primarily for its benefit... We propose the general hypothesis: every ind or O pation that ha 


enough political power to utilize the state will seek to control entry. In addition, the regulatory policy will often be so fashioned as to retard the rate of growth of new firms. 


George Stigler, The Theory of Economic Regulation (1971 ДЕ! 


Regulatory capture theory is a core focus of the branch of public choice referred to as the economics of regulation; economists іп this specialty are critical of 
conceptualizations of governmental regulatory intervention as being motivated to protect public good. Often cited articles include Bernstein (1955), 


Huntington (1952), Laffont & Tirole (1991), and Levine & Forrence (1990). The theory of regulatory capture is associated with Nobel laureate economist George 


Stigler, one of its major developers. 


Likelihood of regulatory capture is a risk to which an agency is exposed by its very nature [3l This suggests that a regulator should be protected from outside 


influence as much as possible. Alternatively, it may be better to not create a given agency at all. A captured regulator is often worse than no regulation, 
because it wields the authority of government. However, increased transparency of the agency may mitigate the effects of capture. Recent evidence suggests 
that, even in mature democracies with high levels of transparency and media freedom, more extensive and complex regulatory environments are associated 
with higher levels of corruption (including r 


George Stigler framed the problem of regulatory capture as "the problem of discovering when and why an industry is able to use the state for its purposes". 
He focuses on whole industries. But, it is never a whole industry which is 'capturing' its regulators, but only the big companies which, using the tool of the 


revolving door, ‘highjack’ the regulator by offering high salaries. Brezis and Cariolle (2019 [10] has shown that the connected firms are always the big firms. 


Indeed, the top 5 financial companies concentrate around 80% of the stock of revolving door movements and regulatory capture. This leads to inequality of 
influence among firms in the same sector. 


It should also be noted that regulatory capture in developed country is not anymore related to corruption and illegal behavior, but to abuse of power. 


Relationship with federalismjeait] 


There is substantial academic literature suggesting that smaller government units are easier for small, concentrated industries to capture than large ones. For 
example, a group of states or provinces with a large timber industry might have their legislature and/or their delegation to the national legislature captured by 

lumber companies. These states or provinces then becomes the voice of the industry, even to the point of blocking national policies that would be preferred 
b 


the majority across the whole country. Moore and Giovinazzo (2012) call this "distortion gap" [2] 
large and powerful industries (e.g. ener 


The opposite is possible. Ve 


banking, weapon system construction) can capture national governments, and then 
use that power to block policies at the national, state or provincial level that the voters may want ЗІ although even local interests can thwart national 


priorities. DZ 


Economic rationalej[eait] 


Regulatory capture has an economic basis: vested interests in an industry have the greatest financial stake in regulations affecting them, and so are more 
likely to t 


ikely ry to influence the regulator than relatively dispersed individual consumers 1 ach with little incentive. When regulators form expert bodies to 


e Ive. Wnen regulators form expert boales to 
examine policy, these invariably feature current or former industry members, or at the very least, individuals with lives and contacts in the industry. Capture is 


also facilitated where consumers or taxpayers have a poorer understanding than businesses of underlying issues. 


Jon Hanson and his co-authors argue that the phenomenon extends beyond political agencies and organizations. Businesses have an incentive to control 
anything that has power over them, including the media, academia and popular culture, and will try to capture them too. This is called "deep capture". - 


Regulatory public interest is based on market failure and welfare economics. It holds that regulation is the response of the government to public needs. Its 
purpose is to make up for market failures, improve the efficiency of resource allocation, and maximize social welfare. Posner pointed out that the public 
interest theory contains the assumption that the market is fragile, and that if left unchecked, it will tend to be unfair and inefficient, and government regulation 
is a costless and effective way to meet the needs of social justice and efficiency. Mimik believes that government regulation is a public administration policy 
that focuses on private behavior. It is a rule drawn from the public interest. Irving and Brouhingan saw regulation as a way of obeying public needs and 


weakening the risk of market operations. They also expressed the view that regulation reflects the public interest. 
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The review of the United States’ history of regulation at the end of the 19th century [€ 'etification needed] especially the regulation of railway tariffs by the Interstate 
Commerce Commission (ICC) in 1887, revealed that regulations and market failures are not co-relevant. At least until the 1960s, regulation was developed in 
the direction of favoring producers, and regulation increased the profits of manufacturers within the industry. In potentially competitive industries such as 
rucking and taxis, regulations allow higher prices and prevent entrants. In monopoly industrie h as electric power generation, there is evidence tha 
regulation has little effect on prices, so the industry can earn excess profits. Evidence shows that regulation is beneficial to producers, “tation needed] 
These observations led to the emergence and development of regulatory capture theory. Contrary to regulatory public interest theory, this holds that the 
provision of regulation adapts to the industry's needs, that is, both the legislator and regulator are controlled and captured by the industry. The basic view of 


the theory is that the regulator gets captured no matter how the regulatory scheme is designed. The implication is that regulation increases the industry's 


profits rather than the social welfare. "ten needed] 


This was essentially a purely capture theory in the early days, that is, the regulators and legislators were captured and controlled by the industry. Later 
regulatory models, such as those by Stigler, Pelzmann, or Becker, follow the regulatory capture theory in the eyes of Posner (1974) and others. All these 
models reflect that regulators and legislators are trying to maximize private, not public, interests. They use "private interest" theory to explain the origin and 
purpose of regulation. Aton (1986) argues that Stigler's theoretical logic is clearer and more central than the previous "capture theory" hypothesis, but it is 


difficult to distinguish between the two. 


Regulatory capture theory has a specific meaning, that is, an experience statement that regulations are beneficial for producers in real life. So it is essentially 
not a true regulatory theory. Although the analysis results are similar to the Stigler model, the methods are completely different. Stigler used standard 
economic analysis methods to analyze the regulation behavior, then created a new regulatory theory — regulatory economic theory. Of course, different 


divisions depend on the criteria for division, and they essentially depend on the researchers' different understanding of specific concepts. DESCRI 


n Г1 т = = n т n n = 1 1 
Justice Douglas’ dissent in Sierra Club v. Morton (1972) describes concern that regulators become too favorable with their regulated industries <#ation needed] 


Typesuyeait] 


There are two basic types of regulatory capture: 


e Materialist capture, also called financial capture, in which the captured regulator's motive is based on its material self-interest. This can result from 
bribery, revolving doors, political donations, or the regulator's desire to maintain its government funding. These forms of capture often amount to 
political corruption. 

e Non-materialist capture, also called cognitive capture or cultural capture, in which the regulator begins to think like the regulated industry. This can 
result from interest groups lobbying the industry. Highly specialized technical industries can pose a risk of cultural capture because the regulating 
agency typically needs to employ experts in the regulated area, and the pool of such experts typically consists largely of existing or former 
employees from the regulated industry. 


17111 


Another distinction сап be made between capture retained by big firms and by small firms. 21 While Stigler mainly referred” to large firms capturing 


regulators by bartering their vast resources (materialist capture), small firms are more prone to retain non-materialist capture via a special underdog 


rhetoric. 91 


Examplesjeaig 


Europe [edit] 


Aberfan disaster[edit 
Main article: Aberfan disaster 
On 21 October 1966, a tip containing spoil and tailings from Merthyr Vale Colliery slipped after a period of heavy rain, killing 116 children and 28 adults in the 
Welsh village of Aberfan. In contravention of the National Coal Board's procedures, the tip was partly based on ground from which water springs were known 


to emerge. After three weeks of rain the tip became saturated and 140,000 cubic yards (110,000 m3) of spoil and tailings slipped down the side of the hill, 
engulfing Pantglas Junior School and a row of houses. 


lain McLean and Martin Johnes, in a 2000 study of the Aberfan disaster, observed that Her Majesty's Inspectorate of Mines went largely unchallenged by the 
tribunal, although the two consider that the organisation failed in their duty, falling in line with the interests of the National Coal Board whose activities they 


were supposed to be overseeing. 2” 
United Statespeait] 


Bur f n Energy Man ment, R lation and Enforcemen П 


was widely cited as an example of regulatory capture, 2 ЕЗІ The MMS then 


me the Bur f n Energy Man ment, R [ation and Enforcemen 
(BOEMRE) and on October 1, 2010, the collection of mineral leases was split off from the agency and placed under the Department of the Interior as the Office 

f Natural R [4 Reven NRR). On r 1, 2011, BOEMRE w hen split into tw. [4 he Bur f Safi nd Environmental Enforcemen 
(BSEE) and the Bureau of Ocean Energy Management (BOEM) 24 


The three-stage reorganization, including the name change to BOEMRE, was part of a re-organization by Ken Salazar 24 who was sworn into office as the new 
Secretary of the Interior on the same day the name change was announced. Salazar's appointment was controversial because of his ties to the energy 
[26] д 


industry. s a senator, Salazar voted against an amendment to repeal tax breaks for ExxonMobil and other major petroleum companies"! and in 2006, he 
voted to end protections that limit offshore oil drilling in Florida's Gulf Coast 28 One of Salazar's immediate tasks was to "[end] the department's coziness 


for the National Mining Association, which lobbies for the mining industry, praised Salazar, saying that he was not doctrinaire about the use of public 
lands, ^9 


MMS had allowed BP and dozens of other companies to drill in the Gulf of Mexico without first attaining permits to assess threats to endangered species, as 
required by law. 29! BP n her compani iven lanket exemption rical excl jon) 2! fr m havin rovi nvironmental im 


statements. The National Oceanic and Atmospheric Administration (NOAA) issued strong warnings about the risks posed by such drilling and in a 2009 
letter, accused MMS of understating the likelihood and potential consequences of a major spill in the Gulf of Mexico. 28! The letter further accused MMS of 


highlighting the safety of offshore drilling while understating the risks and impact of spills and playing down the fact that spills had been increasing. P? Both 


current and former MMS staff scientists said their reports were overruled and altered if they found high risk of accident or environmental impact l? Kieran 


Suckling, director of the Center for Biological Diversity, said, "MMS has given up any pretense of regulating the offshore oil industry. The agency seems to 


think its mission is to help the oil industry evade environmental laws". 29 


After the Deepwater accident occurred, Salazar said he would delay granting any further drilling permits. Three weeks later, at least five more permits had 


been issued by the minerals agency. 291 In March 2011, BOEMRE began issuing more offshore drilling permits in the Gulf of Mexico. Bil Michael Bromwich 
head of BOEMRE, said he was disturbed by the speed at which some oil and gas companies were shrugging off Deepwater Horizon as "a complete aberration, 


rf rm, one in a million" would nonethel n ranting mor: rmi: rill for oil an in th jr S11 


Federal Aviation Administration[edit] 


[citation needed] Ar t by th 
U.S. Department of Transportation found that FAA managers had allowed Southwest Airlines to fly 46 airplanes in 2006 and 2007 that were overdue for safety 
inspections, ignoring concerns raised by inspectors. Audits of other airlines resulted in two airlines grounding hundreds of planes, causing thousands of 


flight cancellations. Pl The House Transportation and Infrastructure Committee investigated the matter after two FAA whistleblowers, inspectors Charalambe 


Bobby" Boutris and Douglas E. Peters, contacted them. Boutris said he attempted to ground Southwest after finding cracks in the fuselage, but was 


prevented by supervisors he said were friendly with the airline 331 The committee subsequently held hearings in April 2008. James Oberstar, former chairman 
of the committee said its investigation uncovered a pattern of regulatory abuse and widespread regulatory lapses, allowing 117 aircraft to be operated 


commercially although not in compliance with FAA safety rules. EA Oberstar said there was a "culture of coziness" between senior FAA officials and the 
airlines and "a systematic breakdown" in the FAA's culture that resulted in "malfeasance, bordering on corruption". ES 
As of 2023, aviation in the United States, the field in which the FAA is tasked to regulate, has had an unparalleled safety streak. Because of this, U.S. 


in thi r is regar ne of th in world. with an estim 2 fligh ily, the Uni hasn't h 


commercial aviation disaster since Colgan Air Flight 3407 in February 2009. 


Ww 


б e y ci = = 
they can work for those they regulated, 22123] The bill also required rotation of principal maintenance inspectors and stipulated that the word "customer" 


properly applies to the flying public, not those entities regulated by the FAA B21 The bill died in the United States Senate Committee on Commerce, Science 


and Transportation that year, 9 In 2008 the FAA proposed to fine Southwest $10.2 million for failing to inspect older planes for cracks HI and in 2009 
Southwest and the FAA agreed that Southwest would pay a $7.5 million penalty and would adapt new safety procedures, with the fine doubling if Southwest 


failed to follow through Dä In September 2009, FAA Administrator Randy Babbitt issued a directive mandating that the agency use the term "customers" only 


to refer to the flying public. 22 


Prior to the deregulation of the US air industry, the Civil Aeronautics Board served to maintain an oligopoly of US airlines, 0181 
In a June 2010 article on regulatory capture, the FAA was cited as an example of "old-style" regulatory capture, "in which the airline industry openly dictates 
to its regulators its governing rules, arranging for not only beneficial regulation but placing key people to head these regulators". 22 


That the FAA was а victim of regulatory capture was опе focus of a United States Senate Commerce Subcommittee оп Aviation and Space meeting held іп the 
wake of the Ethiopian Airlines Flight 302 crash that followed a previous crash of a Lion Air flight and claimed 157 lives. The Boeing 737 MAX platform that 


crashed had been subjected to only an "amended" airworthiness type certificate. The NTSB was tasked with the investigation of the FAA's certification 


pr ocess, 431 


Federal Communications Commission [edit] 


Legal scholars have pointed to the possibility that federal agencies such as the Federal Communications Commission (FCC) had been captured by media 

conglomerates. Peter Schuck of Yale Law School has argued that the FCC is subject to capture by the media industries' leaders and therefore reinforce the 
operation of corporate cartels in a form of "corporate socialism" that serves to "regressively tax consumers, impoverish small firms, inhibit new entry, stifle 

innovation, and diminish consumer choice " [441 The FCC selectively granted communications licenses to some radio and television stations in a process that 


excludes other citizens and little stations from having access to the public. 4! 


Michael K. Powell, who served on the FCC for eight years and was chairman for four, was appointed president and chief executive officer of the National Cable 
& Telecommunications Association, a lobby group, effective April 25, 2011. His role has been the cable industry's leading advocate, spokesman, and 


representative in its relationship with the U.S. Congress, the Administration, the FCC, and other federal agencies. Ê 


Meredith Attwell Baker was one of the FCC commissioners who approved a controversial merger between NBC Universal and Comcast. Four months later, she 


announced her resignation from the FCC to join Comcast's Washington, D.C. lobbying office 441 Legally, she is prevented from lobbying anyone at the FCC for 


two years and an agreement made by Comcast with the FCC as a condition of approving the merger will ban her from lobbying any executive branch agency 
for life. 441 Nonetheless, Craig Aaron, of Free Press, who opposed the merger, complained that "the complete capture of government by industry barely raises 


any eyebrows" and said public policy would continue to suffer from the "continuously revolving door at the ЕСС". 
Іп July 2019, congresswomen Elizabeth Warren and Pramila Jayapal issued a letter (citing а report by the Project On Government Oversight) showing 


concerns for the composition of the FCC's Communications Security, Reliability and Interoperability Council (CSRIC, 


uestioning whether it could effectivel 
serve the public interest if the majority of its members were representatives of the private sector. They wrote that "having the FCC's policy-making process 
rely on input from individuals employed by, or affiliated with, the corporations that it is tasked with overseeing is the very definition of regulatory capture". E 


Food and Drug Administration[edit] 
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Federal Reserve Bank of New York][edit, 


The Federal Reserve Bank of New York (New York Fed) is the most influential of the Federal Reserve Banking System. Part of the New York Fed's 
responsibilities is the regulation of Wall Street, but its president is selected by and reports to a board dominated by the chief executives of some of the banks 
it oversees. While the New York Fed has always had a closer relationship with Wall Street, during the years that Timothy Geithner was president, һе 
became unusually close with the scions of Wall Street banks, DÉI a time when banks and hedge funds were pursuing investment strategies that caused the 


financial crisis of 2007-2 which the Fed fail 
risi veral major banks that were on the ver f coll were r via the Emergency Economi ilization A f2 ЕЗІ Geithner 


During th 
engineered the New York Fed's purchase of $30 billion of credit default swaps from American International Group (AIG), which it had sold to Goldman Sachs, 
Merrill Lynch, Deutsche Bank and Société Générale. By purchasing these contracts, the banks received a "back-door bailout" of 100 cents on the dollar for the 


contracts. IO Had the New York Fed allowed AIG to fail, the contracts would have been worth much less, resulting in much lower costs for any 
faxpa xer-funded | bailout. [50] Geithner defended his use” of unprecedented amounts of laxpayer funds to save the banks from their own mistakes. 


sa ving 


the counterparties that benefited from AIG's bailout, claiming the information would harm AIG. 301 When it became apparent this information would become 


public, a legal staffer at the New York Fed e-mailed colleagues to warn them, lamenting the difficulty of continuing to keep Congress in the dark DÉI Jim 
Rickards calls the bailout a crime and says "the regulatory system has become captive to the banks and the non-banks". ot 


Interstate Commerce Commission[edit] 


Historians, political scientists, and economists have often used the Interstate Commerce Commission (ICC), a now-defunct federal regulatory body in the 
United States, as a classic example of regulatory capture. The creation of the ICC was the result of widespread and longstanding anti-railroad agitation. 
ri rw 


Richard Oln rominent railroad lawyer. w. k railr resident if h ) methin f the ICC. 2! Olney, who 1 


appointed Attorney General in the Grover Cleveland administration, replied in an 1892 letter, 
ang Commission UEM or can be made of great use to the railroads. It satisfies the popular clamor for a самаса supervision of the railroads, at the same 


[32] 


railroad view of things.... The part of wisdom is not to destroy the Commission, but to utilize it. 
While the Interstate Commerce Act forbade "undue and unreasonable prejudice" against interstate passengers, in the sixty-six years before Sarah Keys v. 


Carolina Coach Company (1955) the ICC had ruled against every black petitioner bringing a racial segregation complaint, earning the nickname "The Supreme 


Court of the Confederacy" [3l The ICC then failed to enforce Keys vs. Carolina Coach, attempting to justify segregation on a separate but equal basis for six 
years before being forced by the Department of Justice under then Attorney General Robert F. Kennedy to act in response to the Freedom Riders protests of 


1961 [941/55] 


Nuclear Regulatory Commission[edit] 


According to Frank N. von Hippel, despite the 1979 Three Mile Island accident in Pennsylvania, the Nuclear Regulatory Commission (NRC) has often been too 


timid in ensuring that America's 104 commercial reactors are operated safely: 


Nuclear power is a textbook example of the problem of "regulatory capture" — in which an industry gains control of an agency meant to regulate it. Regulatory 
capture can be countered only by vigorous public scrutiny and Congressional oversight, but in the 32 years since Three Mile Island, interest in nuclear 


regulation has declined precipitously. 


Then-candidate Barack Obama said in 2007 that the five-member NRC had become "captive of the industries that it regulates" and Joe Biden indicated he had 
absolutely no confidence in the agency. 
The NRC has given a license to "every single reactor requesting one", according to Greenpeace USA nuclear policy analyst Jim Riccio to refer to the agency 
approval process as a "rubber stamp". і In Vermont, ten days after the 2011 Tohoku earthquake and tsunami that damaged Japan's Daiichi plant in 
Fukushima, the NRC approved a 20-year extension for the license of Vermont Yankee Nuclear Power Plant, although the Vermont state legislature had voted 
overwhelmingly to deny such an extension. 919! The Vermont plan he same GE Mark 1 reactor design as the Fukushima Daiichi plant.®® The plant 
had been found to be leaking radioactive materials through a network of underground pipes, which Entergy, the company running the plant, had denied under 
oath even existed. Representative Tony Klein, who chaired the Vermont House Natural Resources and Energy Committee, said that when he asked the NRC 
about the pipes at a hearing in 2009, the NRC didn't know about their existence, much less that they were leaking. BÊ On March 17, 2011, the Union of 
Concerned Scientists (UCS) released a study critical of the NRC's 2010 performance as a regulator. The UCS said that through the years, it had found the 
NRC's enforcement of safety rules has not been "timely, consistent, or effective" and it cited 14 "near-misses" at U.S. plants in 2010 alone. A Tyson Slocum, 
an energy expert at Public Citizen said the nuclear industry has "embedded itself in the political establishment" through "reliable friends from George Bush to 
Barack Obama", that the government "has really just become cheerleaders for the industry". E 
Although the exception, there have been instances of a revolving door. Jeffrey Merrifield, who was on the NRC from 1997 to 2008 and was appointed by 
which has a nuclear division regulated by the NRc [note 1] 


[58] 


presidents Clinton and Bush, left the NRC to take an executive position at The Shaw Group, 
The NRC Office of Inspector General concluded that Merrifield violated federal ethics laws by failing to recuse himself from matters affecting prospective 
The NRC Inspector General's report detailed that Merrifield had voted twice on matters involving companies he had contacted about job prospects. In 
alls about his job 


hat he encourage other companies to return 


[63][64] 


addition, the report noted that Merrifield called a senior executive at another utility to reque 
search. The report also noted that Merrifield failed to report certain reimbursed travel expenses for himself and his family. 

One of those interviewed by the NRC Inspector General was Dale Klein, Chairman of the NRC at the time. Klein commented that "Merrifield generally was a 
staunch advocate of his chosen positions and was reluctant to change his mind." The interview notes also indicated that "other Commissioners also 
commented that Merrifield was excessively touting his accomplishments within [a] task force, but Klein indicated that this self-promoting tendency b 

»[65] 


Merrifield was not unique to this issue. 
Although the NRC referred the matter to the Justice Department for civil action and to the U.S. Attorney's office for criminal action, neither office pursued the 
matter Êl The same U.S. Attorney's Office declined all 20 similar referrals for prosecution during the period from 2004 to 2008,24 
A ZERA = z TE z = 7 с e ee 
operation. lÊ The AP found that wear and tear of plants, such as clogged lines, cracked parts, leaky seals, rust and other deterioration resulted in 26 alerts 
a ging y problems and may have been a factor in 113 of the 226 alerts issued by the NRC between 2005 and June 2011.3! The NRC repeatedly 
[68][note 2] 


bout emerging safet 
granted the industry permission to delay repairs and problems often grew worse before they were fixed. 
However, a paper by Stanford University economics professors John B. Taylor and Frank A. Wolak compared the financial services and nuclear industries. 
While acknowledging both are susceptible in principle to regulatory capture, they concluded regulatory failure — including through regulatory capture — has 
been much more of a problem in the financial industry and even suggested the financial industry create an analog to the Institute of Nuclear Power Operations 
to reduce regulatory risk, Ê! 


Office of the Comptroller of the Currency[edit] 


The Office of the Comptroller of the Currency (OCC) has strongly opposed the efforts of the 50 state attorneys general, who have banded together to penalize 
banks and reform the mortgage modification process, following the subprime mortgage crisis and the financial crisis of 2008. This example was cited in The 


New York Times as evidence that the OCC is "a captive of the banks it is supposed to regulate". 20 


Securities and Exchange Commission][edit 


The United States Securities and Exchange Commission (SEC) has also been accused of acting in the interests of Wall Street banks and hedge funds and of 
dragging its feet or refusing to investigate cases or bring charges for fraud and insider trading." Financial analyst Harry Markopolos, who spent ten years 


trying to get the SEC to investigate Bernie Madoff, called the agency "nonfunctional, captive to the industry" 4 


Similarly in the case of the Allen Stanford Ponzi scheme, there were repeated warnings of fraud from both inside and outside the SEC for more than a 
decade. ! But th ney did n he fr ntil 2 r the Madoff scandal m lic in 2 


Arthur J. Samberg, head of Pequot Capital Management 4 once one of the world's largest hedge funds. Z! After more than four years of legal battles, former 
SEC investigator Gary J. Aguirre filed papers in a Freedom of Information Act (FOIA) case he had against the SEC, seeking an order to force the SEC to turn 
over Pequot investigation records to him on the grounds that they had not charged anyone. Aguirre had already provided incriminating evidence of Pequot's 


month later, the SEC settled Aguirre's wrongful termination lawsuit for $755,000, 73 


The list of officials who have left the SEC for highly lucrative jobs in the private sector and who sometimes have returned to the SEC includes Arthur Levitt, 
Robert Khuzami [78] Linda Chatman Thomsen [80] Richard H. Walker, 1 Gary Lynch” and Paul R. Berger. The Project on Government Oversight (POGO) 


released a report on May 13, 2011, which found that between 2006 and 2010, 219 former SEC employees sought to represent clients before the sec. BARBS] 


Former employees filed 789 statements notifying the SEC of their intent to represent outside clients before the commission, some filing within days of leaving 


the ЅЕС.1885] 
Reporter Matt Taibbi calls the SEC a classic case of regulato apture. On Aug 17. 2011, Taibbi reported that in July 2001, a preliminary fraud investigation 
inst D he Bank w. mi Richard H. Walker, then SEC enforcement dir r, wh n workin neral nsel for D he Bank in 


October 2001. Darcy Flynn, an SEC lawyer, the whistleblower who exposed this case also revealed that for 20 years, the SEC had been routinely destroying all 
documents related to thousands of preliminary inquiries that were closed rather than proceeding to formal investigation. The SEC is legally required to keep 
files for 25 years and destruction is supposed to be done by the National Archives and Records Administration. The lack of files deprives investigators of 
possible background when investigating cases involving those firms. Documents were destroyed for inquiries into Bernard Madoff, Goldman Sachs, Lehman 
Brothers, Citigroup, Bank of America and other major Wall Street firms that played key roles in the 2008 financial crisis. The SEC has since changed its policy 


on destroying those documents and as of August 2011 the SEC investigator general was investigating the matter, 8982 


Federal Trade Соттіѕѕіоп[еаіё 


The decision known as In re Amway Corp., and popularly called "Amway '79", made the FTC a captive regulator of the nascent multi-level marketing industry. 


press has widely reported on why the FTC won't act, e.g. Forbes? though legal opinion has been very supportive in some quarters, such as William K. Black, 


who was instrumental in bringing thousands of criminal prosecutions in the S&L scandal, which was also rife with problems of regulatory capture. 2% 


District of Columbia Taxicab Commission[edit 


The District of Columbia Taxicab Commission has been criticized for being beholden to taxi companies and drivers rather than ensuring that the district 
has access to a "safe, comfortable, efficient and affordable taxicab experience in well-equipped vehicles » [92] 


Washington State Liquor and Cannabis Board and 1-502[edit] 


Some commentators 791 һауе acknowledged Edsel words] that while Washington Initiative 502 legalized marijuana, it did so in a manner that led to a state-run 


1 marijuana stores with prices far above that of the existing medical dispensaries, P”! which the Washington State Liquor and Cannabis 


Board is now trying to close down in favor of the recreational stores, where prices are two to five times higher than the product can be obtained elsewhere. 32A 


Canadajedit] 


Canadian Radio-television and Telecommunications Commission [edit 


In August 2009, the Canadian Radio-television and Telecommunications Commission (CRTC) provisionally granted a request by Bell Canada to impose 
usage-based billing on Internet wholesalers, igniting protest from both the wholesalers and consumers, who claimed that the CRTC was "kow-towing to 
Bell" 23] 


On February 2, 2011, CRTC chair Konrad von Finckenstein testified before the House of Commons Standing Committee on Industry, Science and Technology 
to defend the agency's decision. Critic Steve Anderson said, "The CRTC's stubbornness in the face of a mass public outcry demonstrates the strength of the 


Big Telecom lobby's influence. While government officials have recognized the need to protect citizens' communications interests, the CRTC has made it clear 


that their priorities lie elsewhere". 6 


Japan[edit 


In Japan, the line may be blurred between the goal of solving a problem and the different goal of making it look as if the problem is being addressed. BZ 


Nuclear and Industrial Safety Agency/edit 


Despite warnings about its safety, Japanese regulators from the Nuclear and Industrial Safety Agency (NISA) approved a 10-year extension for the oldest of 


the six reactors at Fukushima Daiichi just one month before a 9.0 magnitude earthquake and subsequent tsunami damaged reactors®& and caused a 


meltdown. The conclusion to the Diet of Japan's report on Fukushima attributed this directly to regulatory capture DÉI 


Nuclear opponent? Eisaku Sato, governor of Fukushima Prefecture from 1988 to 2006, said a conflict of interest is responsible for NISA's lack of 
effectiveness as a watchdog. Bä The agency is under the Ministry of Economy, Trade and Industry, which encourages the development of Japan's nuclear 


ndust 


i ry. Inadequate inspections are reviewed by expert panels drawn primarily from academia and rarely challenge the agency. 
eakness in Japan's nuclear indust 


[98] Critics say the main 
Wi D ry is weak oversight. 


DO Seismologist Takashi Nakata said, "The regulators just rubber-stamp the utilities" reports » [102] 
Both the ministry and the agency have ties with nuclear plant operators, such as Tokyo Electric. Some former ministry officials have been offered lucrative 
jobs in a practice called amakudari, "descent from heaven » [981101] д panel responsible for re-writing Japan's nuclear safety rules was dominated by experts 
and advisers from utility companies, said seismology professor Katsuhiko Ishibashi, who quit the panel in protest, saying it was rigged and 
"unscientific" 1011021 The new guidelines, established in 2006, did not set stringent industry-wide earthquake standards, rather nuclear plant operators were 

left to do their own inspections to ensure their plants were compliant 91 


In 2008, the NISA found all of Japan's reactors to be in compliance with the new 


earthquake guidelines. ES 


panel, signing off on inspections. 102 


Ministry of Health, Labour and Welfare (MHLW)[edit 


In 1996, the Ministry of Health and Welfare (now combined with the Minist 


of Labour) came under fire over the scandal of HIV-tainted blood being used to 
treat hemophiliacs. 103 


Although warned about HIV contamination of blood products imported from the U.S., the ministry abruptly changed its position on heated and unheated blood 
products from the U.S., protecting the Green Cross and the Japanese pharmaceutical industry, keeping the Japanese market from being inundated with 


heat-treated blood from the United States, 1931 Because the unheated blood was not taken off the market, 400 people died and over 3,000 people were infected 


with Hiv.42931 
N 


o senior officials were indicted and only one lower-level manager was indicted and convicted. 


[104] Critics say the major task of the ministry is the protection 
of industry, rather than of the population. 


[103] in addition, bureaucrats get amakudari jobs at related industries in their field upon retirement, a system which 
serves to inhibit regulators 1931 Moriyo Kimura, a critic who works at MHLW, says the ministry does not look after the interests of the public. "A 


Philippinesieait] 
Tobacco control in the Phili 


ines is largely vested in the Inter-Agency Committee on Tobacco (IACT) under Republic Act No. 9211 (Tobacco Regulation Act of 
2003). 1951 The ІАСТ'ѕ membership incl. r in the D ment of Agri 


Iture and National Tı Administration,“ as well as "a 
representative from the Tobacco Industry to be nominated by the legitimate and recognized associations of the industry", the Philippine Tobacco Institute 
(composed of the largest local cigarette producers and distributors) 051 In a 2015 Philippine Supreme Court case, the Court ruled that the IACT as the 
"exclusive authority" in regulating various aspects tobacco control including access restrictions and tobacco advertisement, promotion, and sponsorships. In 
this case, the Department of Health, which is the primary technical agency for disease control and prevention, was held to be without authority to create 


tobacco control regulations unless the IACT delegates this function 91 The IACT's 


's organization also limits the Philippines' enforcement of the World Health 
O 


rganization Framework Convention on Tobacco Control. 128 


International [edit] 


World Trade Organization[edit] 


The academic Thomas Alured Faunce has argued the World Trade Organization non-violation nullification of benefits claims, particularly when inserted in 
bilateral trade agreements, can facilitate intense lobbying by industry which can result in effective regulatory capture of large areas of governmental 


policy. 0991 


Other examplesfeait] 


McClean and Johnes argued that the sinking of the Titanic and its investigation illustrate an early example of regulatory failure and regulatory capture of the 
Board of Trade and Parliament by the shipping industry“ 


See alsoyedit] 


Campaign finance 
Concentrated benefits and diffuse costs 
Corporate welfare 

Crony capitalism 

Inv litarianism 
Iron triangle (US politics) 
Occupational licensing 
Regulator shopping 
Regulatory capitalism 
Rent seeking 

State capture 


Other American groups promoting transparency 


MAPLight.org, tracks mon n litics in th 
Sunlight Foundation, promotes government transparency and accountability 


Notesjeait] 


Congress. (See Lichtblau 2011) 


e ^ According to the AP, of the United States’ 104 operating nuclear power plants, 82 are over 25 years old, the NRC has re-licensed 66 for 20 additional years and 


another 16 renewal applications are under review. 
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Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a 
decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network 
architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two 
machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our 
model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 
2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after 
training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer 
generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. 1 Introduction 
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of 
the art approaches in sequence modeling and «Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and 
started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in 
every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and 
became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original 
codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and 
visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier 
codebase, greatly improving results and massively accelerating our research. {Work performed while at Google Brain. {Work performed while at 
Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762v5 [cs.CL] 6 Dec 
2017 transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the 
boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the 
symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, 
as a function of the previous hidden state ht-1 and the input for position t. This inherently sequential nature precludes parallelization within training 
examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved 
significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model 
performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an 
integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their 
distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a 
recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention 
mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new 
state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. 2 Background The goal of reducing sequential 
computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks 
as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations 
required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and 

logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a 
constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract 
with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different 
positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks 


including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 
22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to 
perform well on simple-language question answering and language modeling tasks [34]. To the best of our knowledge, however, the Transformer is the 
first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or 
convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 
18] and [9]. 3 el) Architecture Most competitive ua sequence transduction models OD an ZE structure [5, 2, 35). Here, the 


decoder then generates an output sequence (y1, . m) of s bole one element at a time. At each step the model is den ressive [10], consumin 


the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked 
self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. 2 
Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of М = 6 identical 
layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected 
feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output 
of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual 
connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also 
composed of a stack of М = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which 
performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the 
sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to 
subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for 
position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and 
a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the 


values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 Scaled Dot-Product 
Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in 


arallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and 


keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by Y dk, and apply a softmax 


function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a 
matrix Q. The keys and values are also packed together into matrices K and V. We compute the matrix of outputs as: Attention К, V) = softmax(QKT 


Ү dk )V (1) The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention 


is identical to our algorithm, except for the scaling factor of Y 1 ак. Additive attention computes the compatibility function using a feed-forward 
network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in 
practice, since it can be implemented using highly optimized matrix multiplication code. While for small values of dk the two mechanisms perform 
similarly, additive attention outperforms dot product attention without scaling for larger values of dk [3]. We suspect that for large values of dk, the dot 
products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we 
scale the dot products by Y 1 ак. 3.2.2 Multi-Head Attention Instead of performing a single attention function with dmodel-dimensional keys, values 
and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv 
dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding 
dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. 4To illustrate 
why the dot products get large, assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot 
product, а” ut Рак i21 дікі, 2025 теап 0 а variance dk. = wT IE attention аш the modal to jointly attend to information from different 


R dmodelxdv and WO € R hdvxdmodel. In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 


64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. 3.2.3 
Applications of Attention in our Model The Transformer uses multi-head attention in three different ways: * In "encoder-decoder attention" layers, the 
queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the 
decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence 
models such as [38, 2, 9]. • Тһе encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same 
place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the 
encoder. е Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions іп the decoder up to and including 
that position. We пава to prevent leftward OS flow in the decoder to preserve the auto-regressive property. We EE this re of scaled 


Position-wise Feed-Forward Networks In addition to attention sub-layers, each of the layers in our encoder and decoder contains’ a fully connected 
feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in 
between. FFN(x) = max(0, xW1 + b1)W2 + b2 (2) While the linear transformations are the same across different positions, they use different parameters 
from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, 
and the inner-layer has dimensionality df f = 2048. 3.4 Embeddings and Softmax Similarly to other sequence transduction models, we use learned 
embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and 
softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two 
embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by Y dmodel. 3.5 
Positional Encoding Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we 
must inject some information about the relative or absolute position of the 5 Table 1: Maximum path lengths, per-layer complexity and minimum 
number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of 
convolutions and r the size of the neighborhood in restricted self-attention. Layer Type Complexity per Layer Sequential Maximum Path Length 
8 Self- Attention O(n 2: d) O(1) O(1) Recurrent O(n:d2) Ofn) O(n) Convolutional Ок: п: 92) оп ) O(logk(n)) Self-Attention (restricted) О(г · 


sin(pos/100002i/dmodel) P E(pos,2i+1) = cos os/100002i/dmodel where pos is the position and i is the dimension. That is en dimension of the 


positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2т to 10000 - 2rr. We chose this function because 
we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P Epos+k can be represented as a 
linear function of P Epos. We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced 
nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths 


longer than the ones encountered during training. 4 Why Self-Attention In this section we compare various aspects of self-attention layers to the 
recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another 


sequence of equal length (z1, ..., zn), with xi, zi е Rd, such as a hidden layer іп a typical sequence transduction encoder or decoder. Motivating our 


use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can 
be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies 
in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn 
such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any 
combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the 
maximum path length between any two input and output positions in networks composed of the different layer types. As noted in Table 1, a 
self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) 
sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is 
smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in 


machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very lon 
seguences, self-attention could be restricted to considering only a neighborhood En size r in 6 the input sequence centered around the respective 


convolutional [Eu with kernel width k « n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional 
layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between an 
two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], 
however, decrease the complexity considerably, to O(k - n - d +n · d 2). Even with k = n, however, the complexity of a separable convolution is equal to 
the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. As side benefit, self-attention could 
yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do 
individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of 
the sentences. 5 Training This section describes the training regime for our models. 5.1 Training Data and Batching we trained on the standard WMT 


sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 
36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each 
training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule 
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, 
each mamng step took about 0.4 seconds. We trained the base models ron; a total of 100,000 steps or 12 hours. nore our г big models, (described on the 


steps, and SE it thereafter proportionally to the inverse square root of the step number. We used warmup steps = 4000. 5.4 Regularization We 

employ three types of regularization during training: Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the 
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and 
decoder stacks. For the base model, we use a rate of Pdrop = 0.1. 7 Table 2: The Transformer achieves better BLEU scores than previous 


state-of-the-art models on the English-to- German and English-to-French newstest2014 tests at a fraction of the training cost. Model BLEU Training Cost 
FLOPs) EN-DE EN-FR EN-DE EN-FR ByteNet [18] 23.75 Deep-Att + PosUnk [39] 39.2 1.0 - 1020 GNMT + RL [38] 24.6 39.92 2.3 - 1019 1.4 - 1020 ConvS2S 


be more unsure, but improves accuracy and BLEU score. 6 Results 6.1 Machine Translation On the WMT 2014 English-to-German translation task, the 
big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU 
establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 
8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the 


competitive models. On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the 
reviously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for 


English-to-French used dropout rate Pdrop = 0.1, instead of 0.3. For the base models, we used a single model obtained by averaging the last 5 
checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam 
size of 4 and length penalty a = 0.6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum 
output length during inference to input length * 50, but terminate early when possible [38]. Table 2 summarizes our results and compares our 
translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train 
a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each 
GPU 5. 6.2 Model Variations To evaluate the importance of different components of the Transformer, we varied our base model in different ways, 


measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in 
the previous section, but no checkpoint averaging. We present these results in Table 3. In Table 3 rows (A), we vary the number of attention heads and 
the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 
BLEU worse than the best setting, quality also drops off with too many heads. 5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and 
P100, respectively. 8 Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the 
English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should 


not be compared to per-word perplexities. N dmodel dff h dk dv Pdrop Is train PPL BLEU params steps (dev) (dev) x106 base 6 512 2048 8 64 64 0.1 0.1 
100K 4.92 25.8 65 (A) 1 512 512 5.29 24.9 4 128 128 5.00 25.5 16 32 32 4.91 25.8 32 16 16 5.01 25.4 (B) 16 5.16 25.1 58 32 5.01 25.4 60 (C) 2 6.11 23.7 364 


layers) SE 92.7 Luong et al. (2015) [23] multi-task 93.0 Dyer et al. (2016) [8] ES 93.3 In Table 3 rows (B), we SE that reducing 


the attention key size dk hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility 
cancion than dot elite may be beneficial. We further observe in rows He and (D that as expected big er models are better, and dropout is ve 


identical results to the base E 6.3 English Constituency Parsing To evaluate if the Transformer can generalize to other tasks we performed 
experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural 9 constraints and is 
significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data 
regimes [37]. We trained a 4-layer transformer with dmodel - 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K 
training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with 
approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the 
semi-supervised setting. We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning 
rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. 
During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and a = 0.3 for both WSJ only and the 
semi-supervised setting. Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better 
results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8]. In contrast to RNN sequence-to-sequence 
models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences. 7 Conclusion In 
this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most 
commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly 
faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French 
translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. We are 
excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving 
input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such 
as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is 
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Attention Visualizations Input-Input Layer5 Itisinthis spirit that a majority of American governments have passed new laws since 2009 making the 
registration or voting process more difficult . It is in this spirit that a majority of American governments have passed new laws since 2009 making the 
registration or voting process more difficult . Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder 
self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more 
difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color. 13 Input-Input Layer5 The 
Law will never beperfect , but its application should be just - this is what we are missing , in my opinion . The Law will never be perfect , but its 

application should be just- this is —€— we are eimissing ‚in zn opinion . Input-Input Layers The Law will never beperfect, , but its application should be 
1 і lication sh issi 1 
opinion . Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: 
Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word. 14 Input-Input Layer5 The 
Law will never beperfect , but its application should be just - this is what we are missing , in my opinion . The Law will never be perfect , but its 
application should be just - this is what we are missing , in my opinion . Input-Input Layer5 The Law will never beperfect , but its application should be 
just - this is what we are missing , in my opinion . The Law will never be perfect , but its application should be just - this is what we are missing , in my 
opinion . Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples 
above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks 


